Import data

met_plot <- read_delim("/Users/aleya/Library/CloudStorage/OneDrive-cumc.columbia.edu/Coursework/Data Science I/data/MetObjects.txt") %>%  ## change this to a local call later
  janitor::clean_names() %>%
  select(object_id, object_name, title, accession_year,
         culture, period, department, is_highlight,
         geography_type, city, state, county, country, region, subregion) 


met_plot <- sample_n(met_plot, 5000) ## decide how much of the data to include, if too heavy

Clean up the object_name

Note that the object_name variable can be more detailed than is necessary. Here, I try to create more general categories of objects.

met_plot <- met_plot %>%
  mutate(object_name = ifelse(
    grepl("Textile", object_name), "Textile",
    ifelse(grepl("Painting", object_name), "Painting",
    ifelse(grepl("Relief", object_name), "Relief",
    ifelse(grepl("Print", object_name), "Print", 
           ifelse(grepl("aseball card", object_name), "Baseball card", 
                  ifelse(grepl("Vase", object_name), "Vase", 
                         ifelse(grepl("rnament", object_name), "Vase", 
                                ifelse(grepl("arring", object_name), "Earring", 
                                       ifelse(grepl("ecklace", object_name), "Necklace", 
                                              ifelse(grepl("hotograph", object_name), "Photograph", 
                                                     ifelse(grepl("tatue", object_name), "Statue", 
           object_name))))))))))))

Data checks

Note: substantial missingness for the geography variables. We may want to limit the data to just those with geographic data, but it would be a biased picture. The variable accession_year however, has high completeness! And culture has moderate completeness. Here is a table of the % rows with missing values by selected column:

sapply(met_plot, function(x) sum(is.na(x))/5000)
##      object_id    object_name          title accession_year        culture 
##         0.0000         0.0042         0.0670         0.0088         0.5550 
##         period     department   is_highlight geography_type           city 
##         0.8010         0.0000         0.0000         0.8708         0.9310 
##          state         county        country         region      subregion 
##         0.9952         0.9834         0.8346         0.9330         0.9528

Set color palette based on Wes Anderson package

mypal<-c("#78B7C5",  "#EBCC2A", "#FF0000", "#EABE94", 
         "#3B9AB2", "#B40F20", "#0B775E", "#F2300F", 
         "#5BBCD6", "#F98400", "#ab0213", "#E2D200", 
         "#ff7700", "#46ACC8", "#00A08A", "#78B7C5",
         "#a7ba42", "#f94f8a", "#DD8D29")

Plot 1: Line chart showing number of objects acquired by department over time

met_plot %>%
  group_by(department, accession_year) %>%
  summarize(n = n()) %>%
  plot_ly(x = ~accession_year, y = ~n, 
          color = ~department,  
          type = 'scatter', 
          mode = 'lines+markers', 
          colors = mypal) %>% 
  layout(showlegend = FALSE)
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.

Plot 3: Map

Will play with this next. Need to merge the country names with a country-level shapefile.

met_plot %>% 
  filter(!is.na(country)) %>%
  leaflet() %>%
  addTiles()